วิวัฒนาการของสถาปัตยกรรมโมเดลภาษาขนาดใหญ่: จาก BERT ไปยัง GPT และ T5

สามเส้นทางของสถาปัตยกรรม Transformer

การเปลี่ยนแปลงของโมเดลภาษาขนาดใหญ่ถูกบ่งชี้โดย การเปลี่ยนแปลงแนวคิดหลัก: การเปลี่ยนจากโมเดลเฉพาะงานเป็น "การฝึกก่อนแบบรวมศูนย์" ที่สถาปัตยกรรมเดียวสามารถปรับใช้ได้กับความต้องการด้านการประมวลผลภาษาธรรมชาติหลายอย่าง

แก่นกลางของการเปลี่ยนแปลงนี้คือกลไกการจดจำตนเอง (Self-Attention) ซึ่งช่วยให้โมเดลสามารถประเมินความสำคัญของคำต่าง ๆ ในลำดับข้อความได้:

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. แค่เข้ารหัส (Encoder-Only) (BERT)

กลไก:การเรียนรู้ภาษาแบบซ่อนข้อมูล (Masked Language Modeling - MLM).
พฤติกรรม:บริบทแบบสองทิศทาง; โมเดลจะมองเห็นประโยคทั้งหมดพร้อมกันเพื่อคาดการณ์คำที่ถูกซ่อนไว้
เหมาะสำหรับ:การเข้าใจภาษาธรรมชาติ (NLU), การวิเคราะห์อารมณ์ (sentiment analysis), และการระบุชื่อเฉพาะ (Named Entity Recognition - NER)

2. แค่ถอดรหัส (Decoder-Only) (GPT)

กลไก:การสร้างแบบอัตโนมัติตามลำดับ.
พฤติกรรม:การประมวลผลจากซ้ายไปขวา; คาดการณ์โทเค็นถัดไปตามบริบทก่อนหน้าเท่านั้น (การซ่อนเชิงสาเหตุ)
เหมาะสำหรับ:การสร้างภาษาธรรมชาติ (NLG) และการเขียนเชิงสร้างสรรค์ นี่คือพื้นฐานของโมเดลภาษาขนาดใหญ่สมัยใหม่ เช่น GPT-4 และ Llama 3

3. แค่เข้ารหัส-ถอดรหัส (Encoder-Decoder) (T5)

กลไก:Transformer สำหรับการถ่ายโอนข้อความสู่ข้อความ
พฤติกรรม:ตัวเข้ารหัส (encoder) ประมวลผลสายข้อมูลนำเข้าให้กลายเป็นตัวแทนที่หนาแน่น และตัวถอดรหัส (decoder) สร้างสายข้อมูลปลายทาง
เหมาะสำหรับ:การแปลภาษา การสรุปเนื้อหา และงานที่ต้องการความสมมาตร

ข้อสังเกตสำคัญ: ความโดดเด่นของตัวถอดรหัส

อุตสาหกรรมได้รวมตัวกันอย่างกว้างขวางในแนวทางของ ตัวถอดรหัสเพียงอย่างเดียวสถาปัตยกรรมเพราะกฎการขยายตัวที่เหนือกว่า และความสามารถในการคิดวิเคราะห์ที่เกิดขึ้นเองในสถานการณ์ที่ไม่มีตัวอย่าง (zero-shot)

ผลกระทบของหน้าต่างบริบทต่อ VRAM

ในโมเดลที่มีแค่ตัวถอดรหัส ตัวแปร KV Cacheเติบโตตามจำนวนลำดับข้อความอย่างเป็นเส้นตรง หน้าต่างบริบทขนาด 100,000 ต้องใช้ VRAM มากกว่าหน้าต่างขนาด 8,000 อย่างมีนัยสำคัญ ทำให้การติดตั้งโมเดลที่มีบริบทยาวบนระบบภายในยากโดยไม่ใช้เทคนิคการลดขนาดข้อมูล (quantization)

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?

Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.

Encoders cannot process text bidirectionally.

Decoders require less training data for classification tasks.

Encoders are incompatible with the Self-Attention mechanism.

Question 2

Which architecture treats every NLP task as a "text-to-text" problem?

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5)

Recurrent Neural Networks (RNN)

Challenge: Architectural Bottlenecks

Analyze deployment constraints based on architecture.

If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.

Step 1

Identify the architectural bottleneck regarding context processing.

Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.

Step 2

Justify the preference using Scaling Laws.

Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.